train: add simple loading already tokenized data from parquet dataset #14522

lexasub · 2025-07-03T20:17:48Z

also we need add streaming/batching, but this more complex task:)

lexasub · 2025-07-08T08:49:09Z

@JohannesGaessler what about my changes?)

JohannesGaessler · 2025-07-08T10:47:52Z

Sorry for the late reply. Generally speaking I would greatly prefer it if the training data were to be stored as GGUF files. That will make my life as a maintainer much easier since I won't have to deal with external dependencies.

How about this: come up with a standardized way to define training data as GGUF, write code for constructing an ggml_opt_dataset from GGUF, write code for converting text/Parquet to GGUF (can be Python). In the GGUF file, define one tensor for each sequence of characters or tokens. Streaming can be achieved by at first loading only the metadata and loading the tensor data as needed. I'm not yet 100% sure what the specification for the metadata should be.

lexasub · 2025-07-08T18:08:31Z

some ~

metadata: {
  "num_sequences": 100000,
  "vocab_size": 32000,
  "max_seq_len": 2048,
  "tokenizer": "llama"
}
tensors: [
  { "name": "seq_0", "shape": [2048], "data": [...] },
  { "name": "seq_1", "shape": [1536], "data": [...] },
  ...

JohannesGaessler · 2025-07-08T18:19:15Z

Preferably use a prefix for the metadata and tensors. Looking at llama-arch.cpp how about e.g. dataset.num_sequences? Other than that the layout LGTM.

lexasub · 2025-07-08T18:21:35Z

We will use the training. prefix for all keys to avoid conflicts with model metadata.


training.format.version: string (e.g. "1.0") - Specification version, in case of future changes.

training.dataset.name: string (optional) - Dataset name (e.g. "OpenWebText-ru").

training.dataset.source: string (optional) - URL or description of the data source.

training.file.creation_date: string (ISO 8601) - File creation date.

training.tokenizer.gguf.model: string - Tokenizer model name (llama, gpt2, etc.).

training.tokenizer.gguf.vocab: array[string] - Tokenizer dictionary.

training.tokenizer.gguf.merges: array[string] - Tokenizer merges (for BPE).

training.tokenizer.gguf.pre: string (optional) - Pre-tokenization architecture.

Note: Instead of storing the entire tokenizer, you could reference the model file, but embedding ensures that the data file is completely self-contained.

training.sequence.count: uint64 - Total number of sequences in the file.

training.sequence.lengths: array[uint32] - Key field! An array containing the length of each sequence in tokens. This will allow for efficient "bucketing" (grouping sequences of similar length) in the future.

Tensors

Naming: training.tensor.{index} (e.g. training.tensor.0, training.tensor.1, ...).

Data type: GGML_TYPE_I32 (standard for tokens in llama.cpp).

Shape: [sequence_length] - One-dimensional array. sequence_length will be different for each tensor.

JohannesGaessler · 2025-07-08T18:29:06Z

I don't think you need an array with the sequence lengths per tensor since you can just query the shape of a tensor. I think it's enough to store the maximum sequence length (could also get this from iterating over tensors).

Consider that people may also want to store untokenized datasets, I would suggest using uint8 + metadata for the encoding for those cases (it's fine if this use case is not implemented in this PR).

lexasub · 2025-07-10T00:49:47Z

dirty implementation of converter to new format https://github.com/lexasub/llama.cpp/tree/finetune-backup )

lexasub · 2025-07-10T17:32:40Z

@JohannesGaessler firstly i add support of gguf - dataset #14622

github-actions bot added build Compilation issues examples labels Jul 3, 2025

lexasub force-pushed the parquet2 branch 2 times, most recently from 1bb0911 to 2574024 Compare July 3, 2025 20:22

lexasub marked this pull request as draft July 3, 2025 20:23

train: add simple loading already tokenized data from parquet dataset

76e1b63

lexasub force-pushed the parquet2 branch from 2574024 to 76e1b63 Compare July 3, 2025 20:45

lexasub marked this pull request as ready for review July 3, 2025 20:45

This was referenced Jul 3, 2025

Feature Request: Adding Parquet support for tokenized datasets #14442

Open

Eval bug: example/finetune.cpp crashing #14424

Closed

finetune.cpp command-line arg #13873

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

train: add simple loading already tokenized data from parquet dataset #14522

train: add simple loading already tokenized data from parquet dataset #14522

lexasub commented Jul 3, 2025 •

edited

Loading

Uh oh!

lexasub commented Jul 8, 2025

Uh oh!

JohannesGaessler commented Jul 8, 2025

Uh oh!

lexasub commented Jul 8, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Jul 8, 2025

Uh oh!

lexasub commented Jul 8, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Jul 8, 2025

Uh oh!

lexasub commented Jul 10, 2025 •

edited

Loading

Uh oh!

lexasub commented Jul 10, 2025

Uh oh!

Uh oh!

train: add simple loading already tokenized data from parquet dataset #14522

Are you sure you want to change the base?

train: add simple loading already tokenized data from parquet dataset #14522

Conversation

lexasub commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexasub commented Jul 8, 2025

Uh oh!

JohannesGaessler commented Jul 8, 2025

Uh oh!

lexasub commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 8, 2025

Uh oh!

lexasub commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 8, 2025

Uh oh!

lexasub commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexasub commented Jul 10, 2025

Uh oh!

Uh oh!

lexasub commented Jul 3, 2025 •

edited

Loading

lexasub commented Jul 8, 2025 •

edited

Loading

lexasub commented Jul 8, 2025 •

edited

Loading

lexasub commented Jul 10, 2025 •

edited

Loading